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Abstract 

Building rules on top of ontologies is the ultimate goal of the logical layer of the Semantic 
Web. To this aim an ad-hoc mark-up language for this layer is currently under discussion. 
It is intended to follow the tradition of hybrid knowledge representation and reasoning 
systems such as AC-\og that integrates the description logic ACC and the function-free 
Horn clausal language Datalog. In this paper we consider the problem of automating the 
acquisition of these rules for the Semantic Web. We propose a general framework for rule 
induction that adopts the methodological apparatus of Inductive Logic Programming and 
relies on the expressive and deductive power of ^£-log. The framework is valid whatever 
the scope of induction (description vs. prediction) is. Yet, for illustrative purposes, we also 
discuss an instantiation of the framework which aims at description and turns out to be 
useful in Ontology Refinement. 

KEYWORDS: Inductive Logic Programming, Hybrid Knowledge Representation and Rea- 
soning Systems, Ontologies, Semantic Web 



1 Introduction 

During the last decade increasing attention lias been paid on ontologies and their 



role in Knowledge Engineering (Staab and Studer 20041. In the philosophical sense 



we may refer to an ontology as a particular system of categories accounting for a 
certain vision of the world. As such, this system does not depend on a particular 
language: Aristotle's ontology is always the same, independently of the language 
used to describe it. On the other hand, in its most prevalent use in Artificial In- 
telligence, an ontology refers to an engineering artifact (more precisely, produced 



according to the principles of Ontological Engineering (Gomez-Perez et al. 2004 1), 
constituted by a specific vocabulary used to describe a certain reality, plus a set 
of explicit assumptions regarding the intended meaning of the vocabulary words. 
This set of assumptions has usually the form of a first-order logical theory, where 
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vocabulary words appear as unary or binary predicate names, respectively called 
concepts and relations. In the simplest case, an ontology describes a hierarchy of 
concepts related by subsumption relationships; in more sophisticated cases, suit- 
able axioms are added in order to express other relationships between concepts and 
to constrain their intended interpretation. The two readings of ontology described 
above are indeed related each other, but in order to solve the terminological im- 
passe the word conceptualization is used to refer to the philosophical reading as 



appear in the following definition, based on (Gruber 19931: An ontology is a formal 



explicit specification of a shared conceptualization for a domain of interest. Among 
the other things, this definition emphasizes the fact that an ontology has to be 
specified in a language that comes with a formal semantics. Only by using such a 
formal approach ontologies provide the machine interpretable meaning of concepts 
and relations that is expected when using an ontology-based approach. Among the 
formalisms proposed by Ontological Engineering, the most currently used are De- 
scription Logics (DLs) fBaader et al. 2003 1. Note that DLs are decidable fragments 



of First Order Logic (FOL) that are incomparable with Horn Clausal Logic (HCL) 



as regards the expressive power (Borgida 1996) and the semantics (Rosati 20051 
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Figure 1. Architecture of the Semantic Web. 



Ontology Engineering, notably its DL-based approach, is playing a relevant role 
in the definition of the Semantic Web. The Semantic Web is the vision of the World 
Wide Web enriched by machine-processable information which supports the user 
in his tasks (Berners-Lee et al. 20011. The architecture of the Semantic Web is 
shown in Figure [T] It consists of several layers, each of which is equipped with an 
ad-hoc mark-up language. In particular, the design of the mark-up language for the 
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ontological layer, OWI ^ has been based on the very expressive DL STiOlMiT)) 
(Horrocks et al. 2000 Horrocks et al. 2003[ ). Whereas OWL is ah'eady undergo- 



ing the standardization process at W3C, the debate around a unified language for 
rules is still ongoing. Proposals like SWRl]^ extend OWL with constructs inspired 
to Horn clauses in order to meet the primary requirement of the logical layer: 
'to build rules on top of ontologies'. SWRL is intended to bridge the notorious 
gaps between DLs and HCL in a way that is similar in the spirit to hybridiza- 
tion in Knowledge Representation and Reasoning (KR&R) systems such as AC-\og 
(Donini et al. 19981. Generally speaking, hybrid systems are KR&R systems which 
are constituted by two or more subsystems dealing with distinct portions of a sin- 



gle knowledge base by performing specific reasoning procedures (Frisch and Cohn 



19911. The motivation for investigating and developing such systems is to improve 



on two basic features of KR&R formalisms, namely representational adequacy and 
deductive power, by preserving the other crucial feature, i.e. decidability. In par- 
ticular, combining DLs with HCL can easily yield to undecidability if the interface 
between them is not reduced (Levy and Rousset 19981. The hybrid system AC- 



log integrates ACC ( Schmidt-Schauss and Smolka 19911 and Datalog (Ceri et al. 



1990 1 by using ACC concept assertions essentially as type constraints on variables. 



It has been very recently mentioned as the blueprint for well-founded Semantic Web 
rule mark-up languages because its underlying form of integration (called safe) as- 
sures semantic and computational advantages that SWRL - though more expressive 
than AC-log - currently can not assure (Rosati 20051. 

Defining rules (including the ones for the Semantic Web) has been usually consid- 
ered as a demanding task from the viewpoint of Knowledge Engineering. It is often 
supported by Machine Learning algorithms that can vary in the approaches. The 
approach known under the name of Inductive Logic Programming (ILP) seems to be 
promising for the case at hand due to the common roots with Logic Programming 
(Flach and Lavrac 2002). ILP has been historically concerned with rule induction 
from examples and background knowledge within the representation framework of 
HCL and with the aim of prediction ( Nienhuys-Cheng and de Wolf 1997] ). More 
recently ILP has moved towards either different FOL fragments (e.g., DLs) or new 
learning goals (e.g., description). In this paper we resort to the methodological 
apparatus of ILP to define a general framework for learning rules on top of ontolo- 
gies for the Semantic Web within the KR&R framework of ^£-log. The framework 
proposed is general in the sense that it is valid whatever the scope of induction 
(description vs. prediction) is. For the sake of illustration we concentrate on an 
instantiation of the framework for the case of description. 

The paper is organized as follows. Section [2] introduces the basic notions of AC- 
log. Section[3]defines the framework for learning rules in ^£-log. Section|4]illustrates 
an instantiation of the framework. Section [5] concludes the paper with final remarks. 
[Appendix A| clarifies the links between OWL and DLs. 



1 http://vrww.w3.org/2004/DWL/ 

^ http : //www. w3 . org/Submission/SWRL/ 
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Table 1. Syntax and semantics of ACC. 



bottom (resp. top) concept 
atomic concept 
role 
individual 



_L (resp. T) (resp. A^) 
A C 

R 7?^ C A^ X A^ 

a ffl^ € A^ 



concept negation 
concept conjunction 
concept disjunction 
value restriction 
existential restriction 



CnD nD^ 

CUD C^UD^ 

"rfR- C {xeA^\\fy {x, y)£ R^ ^ye C^} 

3R-C {x G \3y lx,y) e R^ Ay e C^} 



equivalence axiom 


C = D 




subsumption axiom 


C D 


C 


concept assertion 


a : C 




role assertion 


{a, b) : R 





2 Basics of AC-\og, 



The system AC-\og (Donini at al. 19981 integrates two KR&R systems: Structural 
and relational. 



2.1 The structural subsystem 



The structural part S is based on ACC ( Schmidt-Schauss and Smolka 19911 and 



allows for the specification of knowledge in terms of classes {concepts), binary re- 
lations between classes (roles), and instances (individuals). Complex concepts can 
be defined from atomic concepts and roles by means of constructors (see Table [ij . 
Also S can state both is-a relations between concepts (axioms) and instance-of re- 
lations between individuals (resp. couples of individuals) and concepts (resp. roles) 
(assertions). An interpretation I = (A-^,--^) for E consists of a domain A-^ and 
a mapping function --^ . In particular, individuals are mapped to elements of A-^ 
such that ^ if a ^ b (Unique Names Assumption (UNA) ([Reiter 1980|). 



If O C A-^ and Va e O : a-^ = a, X is called O -interpretation. Also E represents 
many different interpretations, i.e. all its models (Open World Assumption (OWA) 
dBaader et al. 2003[ )). 



The main reasoning task for S is the consistency check. This test is performed 
with a tableau calculus that starts with the tableau branch 5 = E and adds asser- 
tions to 5* by means of propagation rules such as 

• S SU{s : D} if 



1. s : Ci U Ca is in S, 
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2. D = Ci and D = C2, 

3. neither s : Ci nor s : C2 is in 5* 

• s su{t : c} a 

1. s : Vi? • C is in 5", 

2. sRt is in S, 

3. t : C is not in 5* 

• S SU{s : C'UDjii 

1. C C D is in 5", 

2. s appears in 5", 

3. C" is the NNF concept equivalent to 

4. s : U D is not in 5* 

• 6* ->_L {s : -L} if 

1. s : A and s : -^A are in S , or 

2. s : is in 5", 

3. s : _L is not in 5" 

until either a contradiction is generated or an interpretation satisfying S can be 
easily obtained from it. 

2.2 The relational subsystem 

The relational part of ^£-log allows one to define Dataloc^ programs enriched 
with constraints of the form s : C where s is either a constant or a variable, and C 
is an ^>CC-concept. Note that the usage of concepts as typing constraints applies 
only to variables and constants that already appear in the clause. The symbol & 
separates constraints from Datalog atoms in a clause. 

Definition 1 

A constrained Datalog clause is an implication of the form 

ao ^ ai, . . . , Q;m&7i, . . . ,7„ 

where m > 0, n > 0, are Datalog atoms and are constraints. A constrained 
Datalog program 11 is a set of constrained Datalog clauses. 

An AC-log knowledge base B is the pair (S, H) where E is an ACC knowledge base 
and n is a constrained Datalog program. For a knowledge base to be acceptable, 
it must satisfy the following conditions: 

• The set of Datalog predicate symbols appearing in 11 is disjoint from the 
set of concept and role symbols appearing in E. 

• The alphabet of constants in 11 coincides with the alphabet O of the individ- 
uals in S. Furthermore, every constant in H appears also in S. 



^ For the sake of brevity we assume the reader to be familiar with DATALOG. 
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• For each clause in 11, each variable occurring in the constraint part occurs 
also in the Datalog part. 

These properties state a safe interaction between the structural and the relational 
part of an ^£-log knowledge base, thus solving the semantic mismatch between 



the OWA of ACC and the CWA of Datalog ( [Rosati 2005[ ). This interaction is 
also at the basis of a model-theoretic semantics for ^£-log. We call Hd the set of 
Datalog clauses obtained from the clauses of 11 by deleting their constraints. We 
define an interpretation J for B as the union of an O-interpretation Xq for E (i.e. 
an interpretation compliant with the unique names assumption) and an Herbrand 
interpretation Xji for II^i. An interpretation JT" is a model of B if To is a model 
of S, and for each ground instance a'g ^ a'l, ■ ■ ■ ,a'„&7j, ■ ■ ■ of each clause 
ao ^ ai, . . . ,am&7j, . . . ,7^j in 11, either there exists one 7^, i € {1, ... , n}, that 
is not satisfied by J, or a'g -i— a[, . . . , a',„ is satisfied by J. The notion of logical 
consequence paves the way to the definition of answer set for queries. Queries to 
^£-log knowledge bases are special cases of Definition [T| An answer to the query 
(5 is a ground substitution a for the variables in Q. The answer a is correct w.r.t. 
a ^£-log knowledge base B if Qa is a logical consequence of B {B \= Qcr). The 
answer set of Q in B contains all the correct answers to Q w.r.t. B. 

Reasoning for ^£-log knowledge bases is based on constrained SLD- resolution 



(iDonini et al. 19981, i.e. an extension of SLD-resolution to deal with constraints. 



In particular, the constraints of the resolvent of a query Q and a constrained Dat- 
alog clause E are recursively simplified by replacing couples of constraints t : C , 
t : D with the equivalent constraint t : C n D. The one-to-one mapping between 
constrained SLD-derivations and the SLD-derivations obtained by ignoring the con- 
straints is exploited to extend known results for Datalog to ^£-log. Note that 
in AC-log a derivation of the empty clause with associated constraints does not 
represent a refutation. It actually infers that the query is true in those models of B 
that satisfy its constraints. Therefore in order to answer a query it is necessary to 
collect enough derivations ending with a constrained empty clause such that every 
model of B satisfies the constraints associated with the final query of at least one 
derivation. 

Definition 2 

Let Q(o) be a query <— /3i, . . . , /3m&7i, . . . , 7„ to a .4£-log knowledge base B . A 
constrained SLD-refutation for Q^^^ in S is a finite set {di, . . . , dg} of constrained 
SLD-derivations for Q*-"-* in B such that: 

1. for each derivation di, 1 < i < s, the last query Q of di is a constrained 
empty clause; 

2. for every model J' of B, there exists at least one derivation d^, 1 < i < s, 
such that J 

Constrained SLD-refutation is a complete and sound method for answering ground 



queries (Donini et al. 19981 



Lemma 1 
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Let (5 be a ground query to an AC-\og knowledge base B. It holds that ,B h Q if 
and only if ;B ^ Q. 

An answer ct to a query Q is a computed answer if there exists a constrained 
SLD-refutation for Qa in B {B \- Qcr). The set of computed answers is called the 
success set of Q in B. Furthermore, given any query Q, the success set of Q in 
B coincides with the answer set of Q in B. This provides an operational means 
for computing correct answers to queries. Indeed, it is straightforward to see that 
the usual reasoning methods for Datalog allow us to collect in a finite number 
of steps enough constrained SLD-derivations for Q in S to construct a refutation - 
if any. Derivations must satisfy both conditions of Definition [2] In particular, the 
latter requires some reasoning on the structural component of B. This is done by 
applying the tableau calculus as shown in the following example. 

Constrained SLD-resolution is decidable ( jDonini et al. 1998| ). Furthermore, be- 
cause of the safe interaction between A£C and Datalog, it supports a form of 
closed world reasoning, i.e. it allows one to pose queries under the assumption that 



part of the knowledge base is complete (Rosati 20051 



3 The general framework for learning rules in ^£-log 

In our framework for learning in AC-log we represent inductive hypotheses as con- 
strained Datalog clauses and data as an ^£-log knowledge base B. In particular B 
is composed of a background knowledge JC and a set O of observations. We assume 
/cn O = 0. 

To define the framework we resort to the methodological apparatus of ILP which 
requires the following ingredients to be chosen: 

• the language £ of hypotheses 

• a generality order ^ for £ to structure the space of hypotheses 

• a relation to test the coverage of hypotheses in £ against observations in O 
w.r.t. IC 

The framework is general, meaning that it is valid whatever the scope of in- 
duction (description/prediction) is. Therefore the Datalog literal g(XrJin the 
head of hypotheses represents a concept to be either discriminated from others 
{discriminant induction) or characterized {characteristic induction). 



This section collects and upgrades theoretical results published in (Lisi and 



Malerba 2003a Lisi and Malerba 2003b Lisi and Esposito 20041. 



3.1 The language of hypotheses 

To be suitable as language of hypotheses, constrained Datalog clauses must satisfy 
the following restrictions. 

First, we impose constrained Datalog clauses to be linked and connected (or 



range-restricted) as usual in ILP ( Nienhuys-Cheng and de Wolf 19971. 



* X is a tuple of variables 
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Definition 3 

Let _ff be a constrained Datalog clause. A term t in some literal k E H is linked 
with linking-chain of length 0, if t occurs in head{H), and is linked with linking- 
chain of length d + l, if some other term in k is linked with linking-chain of length d. 
The link-depth of a term t in some li € H is the length of the shortest linking-chain 
of t. A literal k E H is linked if at least one of its terms is linked. The clause H 
itself is linked if each k E H is linked. The clause H is connected if each variable 
occurring in head(H) also occur in body(H). 

Second, we impose constrained Datalog clauses to be compliant with the bias 



of Object Identity (01) ([Semeraro et al. 19981. This bias can be considered as 



an extension of the UNA from the semantic level to the syntactic one of ^£-log. 
We would like to remind the reader that this assumption holds in ACC. Also it 
holds naturally for ground constrained Datalog clauses because the semantics of 
AC-log adopts Herbrand models for the Datalog part and O-models for the con- 
straint part. Conversely it is not guaranteed in the case of non-ground constrained 
Datalog clauses, e.g. different variables can be unified. The 01 bias can be the 
starting point for the definition of either an equational theory or a quasi-order 
for constrained Datalog clauses. The latter option relies on a restricted form of 
substitution whose bindings avoid the identification of terms. 

Definition 4 

A substitution a is an Ol-substitution w.r.t. a set of terms T iff V^i, t2 E T: ti ^ t2 
yields that tia ^ t2U. 

From now on, we assume that substitutions are Ol-compliant. 



3.2 The generality relation 

In ILP the key mechanism is generalization intended as a search process through a 



partially ordered space of hypotheses (Mitchell 1982 1. The definition of a generality 



relation for constrained Datalog clauses can disregard neither the peculiarities of 
^£-log nor the methodological apparatus of ILP. Therefore we rely on the reason- 
ing mechanisms made available by AC-\og knowledge bases and propose to adapt 



Buntine's generalized subsumption (Buntine 1988) to our framework as follows. 
Definition 5 

Let _ff be a constrained Datalog clause, a a ground Datalog atom, and J an 
interpretation. We say that H covers a under if there is a ground substitution 9 
for H {H9 is ground) such that body{H)9 is true under J7 and head{H)9 = a. 

Definition 6 

Let Hi, H2 be two constrained Datalog clauses and B an AC-log knowledge base. 
We say that Hi B-subsumes H2 if for every model J oi B and every ground atom 
a such that H2 covers a under J^, we have that Hi covers a under J^. 

We can define a generality relation for constrained Datalog clauses on the 
basis of S-subsumption. It can be easily proven that is a quasi-order (i.e. it is 
a reflexive and transitive relation) for constrained Datalog clauses. 
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Definition 1 

Let i/i, Hi be two constrained Datalog clauses and B an AH-Xog knowledge base. 
We say that is at least as general as H2 under S-subsumption, Hi H2, iff Hi 
i3-subsumes H2. Furthermore, Hi is more general than H2 under S-subsumption, 
Hi H2, iff Hi >s H2 and H2 Hi. Finally, Hi is equivalent to H2 under 
;B-subsumption, Hi H2, iff Hi H2 and H2 Hi. 

The next lemma shows the definition of S-subsumption to be equivalent to an- 
other formulation, which will be more convenient in later proofs than the definition 
based on covering. 

Definition 8 

Let B be an AC-\og knowledge base and iJ be a constrained Datalog clause. Let 
Xi, . . . , Xn be all the variables appearing in H , and ai, . . . , a„ be distinct constants 
(individuals) not appearing in B or H. Then the substitution {Xi/oi, . . . ,X„/a,i} 
is called a Skolem substitution for H w.r.t. B. 

Lemma 2 

Let Hi, H2 be two constrained Datalog clauses, B an ^£-log knowledge base, and 
a a Skolem substitution for H2 with respect to {Hi} U B. We say that Hi H2 iff 
there exists a ground substitution 9 for Hi such that (i) head{Hi)9 = head{H2)(J 
and (11) BUbodyiH2)a ^ body{Hi)e. 

Proof 

(^) Suppose Hi >s H2. Let B' be the knowledge base BU body{H2)<J and J = 
{Io,Ih) be a model of B' where Iq Is the minimal C-model of S and Ih be the 
least Herbrand model of the Datalog part of B' . The substitution cr is a ground 
substitution for H2, and body{H2)<J is true under , so H2 covers head{H2)(J under 
by Definitionjsj Then Hi must also cover head{H2)a under J^. Thus there is a 
ground substitution 9 for Hi such that head(Hi)9 = head{H2)(T , and body(Hi)9 
is true under J', i.e. |= body{Hi)9. By properties of the least Herbrand model, 
it holds that B U body{H2)(J h hence B U body{H2)a \= body{Hi)9. 

(-^) Suppose there is a ground substitution 9 for Hi, such that head{Hi)9 = 
head{H2)a and BU body(H2)(T ^ body{Hi)9. Let a be some ground atom and J7a 
some model of B such that 7^2 covers a under J^. To prove that Hi we 
need to prove that Hi covers a under J7q. 

Construct a substitution 9' from 9 as follows: for every binding X / c G cr, replace 
c in bindings in 6* by X. Then we have Hi9'a ~ Hi9 and none of the Skolem 
constants of cr occurs in 9' . Then head{Hi)9' a = head{Hi)9 = head{H2)a, so 
head{Hi)9' = head{H2). Since 7/2 covers a under J7q, there is a ground substitu- 
tion 7 for H2, such that body{H2)"f is true under J7q, and head{H2)^ — a. This 
implies that head{Hi)9'j = head{H2)^ ~ a. 

It remains to show that body(Hi)9'"f is true under J7q. Since BU body{H2)(J |= 
body (Hi) 9' a and <— body (Hi) 9' cr is a ground query, it follows from Lemma [l] that 
there exists a constrained SLD-refutation for <— body{Hi)9'a In BUbody{H2)a. By 
Definition I2I there exists a finite set {di, . . . , rf^} of constrained SLD-derivations, 
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having <— body{Hi)d'a as top clause and elements of S U body{H2)a as input 
clauses, such that for each derivation di, i G {1, . . . ,m}, the last query Q of 
di is a constrained empty clause and for every model of B U body{H2)a, there 
exists at least one derivation rf^, z G {1, . . . , m}, such that \= Q^"'\ We want 
to turn this constrained SLD-refutation for <— body{Hi)9'a in body{H2)u into 
a constrained SLD-refutation for ^ body{Hi)9'j in SU body{H2)j, thus proving 
that B U body{H2)j |= body{Hi)0'^. Let Xi, . . . , X„ be the variables in body{H2) 
such that {Xi/ ci, . . . , Xn/ Cn} C (7, and {Xi/^i, . . . , X„/<„} C 7. If we replace 
each Skolem constant Cj by , 1 < j < n, in both the SLD-derivations and the 
models of body{H2)(J we obtain a constrained SLD-refutation of body{Hi)9'^ 
in S U body{H2)j. Hence BU body{H2)j \= body{Hi)d'j. Since J7q is a model of 
B U body{H2)'y, it is also a model of body{Hi)0'^. 
□ 

The relation between S-subsumption and constrained SLD-resolution is given 
below. It provides an operational means for checking S-subsumption. 

Theorem 1 

Let Hi, H2 be two constrained Datalog clauses, B an AC-\og knowledge base, and 
a a Skolem substitution for H2 with respect to {Hi} U B. We say that Hi H2 
iff there exists a substitution 9 for ffi such that (i) head{Hi)9 = head{H2) and (ii) 
BLS body{H2)a h body{Hi)9a where body{Hi)9a is ground. 

Proof 

By Lemma [2J we have Hi ^2 iff there exists a ground substitution 9' for _ffi, 
such that head{Hi)9' — head{H2)cr and S U body{H2)a \= body{Hi)9' . Since a is 
a Skolem substitution, we can define a substitution 9 such that HiOa = HiO' and 
none of the Skolem constants of a occurs in 9. Then head{Hi)9 = head{H2) and 
BU body{H2)(7 |= body{Hi)9a. Since body{Hi)9a is ground, by Lemma [T] we have 
B U body{H2)(T h body{Hi)9a, so the thesis follows. □ 

The decidability of B-subsumption follows from the decidability of both gener- 



alized subsumption in Datalog (Buntine 19881 and query answering in ^£-log 



(Donini et al. 19981 



3.3 Coverage relations 



When defining coverage relations we make assumptions as regards the representa- 
tion of observations because it impacts the definition of coverage. In ILP there are 
two choices: we can represent an observation as either a ground definite clause or a 
set of ground unit clauses. The former is peculiar to the normal ILP setting (also 
called learning from implications) (Frazier and Page 19931, whereas the latter is 



usual in the logical setting of learning from interpretations { De Raedt and Dzeroski 



19941. The representation choice for observations and the scope of induction are 



orthogonal dimensions as clearly explained in (De Raedt 1997). Therefore we prefer 



the term 'observation' to the term 'example' for the sake of generality. 
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In the logical setting of learning from entailment extended to ^£-log, an obser- 
vation Oi S O is represented as a ground constrained Datalog clause having a 
ground atom ^(ai^in the head. 

Definition 9 

Let H E Che a, hypothesis, /C a background knowledge and G O an observation. 
We say that H covers Oi under entailment w.r.t K, iS ICU H \= Oi. 

In order to provide an operational means for testing this coverage relation we 
resort to the Deduction Theorem for first-order logic formulas ( |Nienhuys-Cheng 
and de WolfTOOT] ). 

Theorem 2 

Let S be a set of formulas, and (j) and ijj be formulas. We say that E U {0} |= -0 iff 
Theorem 3 

Let _ff G £ be a hypothesis, JC a background knowledge, and G O an observation. 
We say that H covers Oi under entailment w.r.t. /C iff /C U hody{oi) U H h q{di). 

Proof 

The following chain of equivalences holds: 

• H covers Oi under entailment w.r.t. K (by Definition [9]) 

• KyjH \= q{a{) ^ hody{oi) ^ (by Theorem [2]) 

• K. U H U body{oi) \= q{di) ^ (by Lemma[T]) 

• /C U body(oi) U H \- q{di) 

□ 

In the logical setting of learning from interpretations extended to ^£-log, an 
observation G O is represented as a couple {q{di), Ai) where Ai is a set containing 
ground Datalog facts concerning the individual i. 

Definition 10 

Let iJ G £ be a hypothesis, K, a background knowledge and G O an observation. 
We say that H covers Oi under interpretations w.r.t. K, iS K, U Ai U H \= q{di). 

Theorem 4 

Let G £ be a hypothesis, JC a background knowledge, and G O an observation. 
We say that H covers Oi under interpretations w.r.t. /C iff /C U U h q{di). 

Proof 

Since q{di) is a ground query to the ^£-log knowledge base B = ICU Ai^J H , the 
thesis follows from Definition [lO] and Lemma [TJ □ 

Note that both coverage tests can be reduced to query answering. 

^ a; is a tuple of constants 
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4 Instantiating the framework for Ontology Refinement 



Ontology ReGnement is a phase in the Ontology Learning process that aims at the 
adaptation of an existing ontology to a specific domain or the needs of a particular 



user (Maedche and Staab 2004). In this section we consider the problem of Con- 



cept Refinement which is about refining a known concept, called reference concept, 
belonging to an existing taxonomic ontology in the light of new knowledge com- 
ing from a relational data source. A taxonomic ontology is an ontology organized 



around the is-a relationship between concepts (Gomez-Perez et al. 20041. We as- 



sume that a concept C consists of two parts: an intension int{C) and an extension 
ext(C). The former is an expression belonging to a logical language C whereas the 
latter is a set of objects that satisfy the former. More formally, given 

• a taxonomic ontology S, 

• a reference concept Cref G S, 

• a relational data source 11, 

• a logical language C 

the Concept Refinement problem is to find a taxonomy Q of concepts Ci such that 
(i) int{Ci) £ C and (ii) ext{Ci) C ext(Cref)- Therefore Q is structured according to 
the subset relation between concept extensions. Note that Cref is among both the 
concepts defined in S and the symbols of C. Furthermore ext{Ci) relies on notion 
of satisfiability of int{Ci) w.r.t. B = S U 11. We would like to emphasize that B 
includes E because in Ontology Refinement, as opposite to other forms of Ontology 
Learning such as Ontology Extraction (or Building), it is mandatory to consider 
the existing ontology and its existing connections. Thus, a formalism like ^£-log 



suits very well the hybrid nature of B (see Section 4.1 1 



In our ILP approach the Ontology Refinement problem at hand is reformulated 



as a Concept Formation problem (Lisi and Esposito 2007). Concept Formation in 



dicates a ML task that refers to the acquisition of conceptual hierarchies in which 
each concept has a fiexible, non-logical definition and in which learning occurs in- 



crementally and without supervision ( ^Langley 1987 ). More precisely, it is to take a 
large number of unlabeled training instances: to find clusterings that group those in- 
stances in categories: to find an intensional definition for each category that summa- 



rized its instances; and to find a hierarchical organization for those categories (Gen 



nari et al. 1989). Concept Formation stems from Conceptual Clustering (Michalski 



and Stepp 1983). The two differ substantially in the methods: The latter usually 



applies bottom-up batch algorithms whereas the former prefers top-down incremen- 
tal ones. Yet the methods are similar in the scope of induction, i.e. prediction, as 



opposite to (Statistical) Clustering (Hartigan 2001) and Frequent Pattern Discov- 



ery (Mannila and Toivonen 1997) whose goal is to describe a data set. According 



to the commonly accepted formulation of the task (Langley 1987 Gennari et al 



1989), Concept Formation can be decomposed in two sub-tasks: 



1. clustering 

2. characterization 
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The former consists of using internalised heuristics to organize the observations 
into categories whereas the latter consists in determining a concept (that is, an 
intcnsional description) for each extensionally defined subset discovered by cluster- 



ing. We propose a pattern-based approach for the former (see Section 4.2 1 and a 



bias-based approach for the latter (see Section 4.3 1. In particular, the clustering 
approach is pattern-based because it relies on the aforementioned commonalities 
between Clustering and Frequent Pattern Discovery. Descriptive tasks fit the ILP 



setting of characteristic induction (De Raedt and Dehaspe 1997). A distinguish- 



ing feature of this form of induction is the density of solution space. The setting 
of learning from interpretations has been shown to be a promising way of dealing 



with such spaces (Blockeel et al. 19991. 



Definition 11 

Let £ be a hypothesis language, JC a background knowledge, O a set of observations, 
and M{B) a model constructed from B — ICUO. The goal of characteristic induction 
from interpretations is to find a set 7i C £ of hypotheses such that (i) H is true in 
M{B), and (ii) for each H e C, if H is true in M{B) then H \= H . 



In the following subsection we will clarify the nature of IC and O. 



4-1 Representation Choice 

The KR&R framework for conceptual knowledge in the Concept Refinement prob- 
lem at hand is the one offered by ^£-log. 

The taxonomic ontology E is a ACC knowledge base. From now on we will 
call input concepts all the concepts occurring in S. 

Example 1 

Throughout this Section, we will refer to the ACC ontology Sqj;^ (see Figure [2]) 
concerning countries, ethnic groups, languages, and religions of the world, and built 
according to Wikipedi£0 taxonomies. For instance, the expression 

MiddleEastCountry = AsisinCountry □ 3Hosts .MiddleEasternEthnicGroup. 

is an equivance axiom that defines the concept MiddleEastCountry as an Asian 
country which hosts at least one Middle Eastern ethnic group. In particular, Ar- 
menia ('ARM') and Iran ('IR') are two of the 15 countries that are classified as 
Middle Eastern. 

The relational data source 11 is a Datalog program. The extensional part of 
n is partitioned into portions Ai each of which refers to an individual of CVe/- 
The link between Ai and is represented with the Datalog literal q{ai). The pair 
{q{ai),Ai) is called observation. The intensional part (IDB) of 11 together with the 
whole E is considered as background knowledge for the problem at hand. 



http : / /www . wikipedia . org/ 
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Figure 2. The ontology used as an example throughout Section 4.1 



Example 2 

An yl£-log knowledge base Sqj^ has been obtained by integrating Scj^ and a 
Datalog program II^jj^ based on the on-hne 1996 CIA World Fact Bool<Q The 
extensional part of Hf^j^ consists of Datalog fact^ grouped according to the 
individuals of MiddleEastCountry. In particular, the observation ' IR' ), y^jp^) 
contains Datalog facts such as 

language (' IR' , 'Persian' ,58) . 
religionClR' , ' ShiaMuslim' ,89) . 
religionClR' , ' SunniMuslim' ,10) . 

concerning the individual ' IR' . The intensional part of IIqjj^ defines two views on 
language and religion: 

speaks (CountrylD , LanguageN) <— language (CountrylD , LanguageN , Perc) 

& CountrylD : Country , LcLnguageN : Lciriguage . 

believes (CountrylD , ReligionN) <— religion (CountrylD , ReligionN , Perc) 

& CountrylD : Country , ReligionN: Religion. 

that can deduce new Datalog facts when triggered on Bqip^. It forms the back- 
ground knowledge /C^j^ together with the whole SqiA- 



http://vrww.odci . gov/cia/publications/f actbook/ 
* http : //www.dbis . informatik.uni-goettingen.de/ Mondial/mondial-rel-f acts . f Ip 
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The language L contains expressions, called O-queries, relating individuals of 
Cref to individuals of other input concepts [task-relevant concepts). An O-query is 
a constrained Datalog clause of the form 

Q = q{X) ^ ai, . . . ,am&zX : Cref,l2, ■ • • ,7n- 



which is compliant with the properties mentioned in Section 3.1 The O-query 

Qt = q{X) ^ &X : Cref 

is called trivial for £ because it only contains the constraint for the distinguished 
variable X . Furthermore, the language C is multi-grained, i.e. it contains expressions 
at multiple levels of description granularity. Indeed it is implicitly defined by a 
declarative bias specification which consists of a finite alphabet A of Datalog 
predicate names appearing in 11 and finite alphabets F' (one for each level / of 
description granularity) of ACC concept names occurring in E. Note that ai's are 
taken from A and •jj's are taken from F'. We impose C to be finite by specifying 
some bounds, mainly maxD for the maximum depth of search and maxG for the 
maximum level of granularity. 

Example 3 

We want to refine the concept MiddleEastCountry belonging to Scia in the light of 
the new knowledge coming from Il^jy^. More precisely, we want to describe Middle 
East countries (individuals of the reference concept) with respect to the religions 
believed and the languages spoken (individuals of the task-relevant concepts) at 
three levels of granularity {maxG = 3). To this aim we define £ciA the set of O- 
queries with Cref — MiddleEastCountry that can be generated from the alphabet 

A= {believes/2, speaks/2} 

of Datalog binary predicate names, and the alphabets 
F^= {Lcinguage, Religion} 

r^= {indoEuropesinLanguage, . . . , MonotheisticReligion, . . .} 
r^= {IndolrsLnianLanguage, . . . , MuslimReligion, . . .} 

of ACC concept names for 1 < / < 3, up to maxD — 5. Note that the names in A 
are taken from II^jj^ whereas the names in F''s are taken from Sqjj^. Examples of 
C-queries in >CciA 

Q^= q(X) ^ & X: MiddleEastCountry 

Qi= q(X) ^ speaks (X,Y) & X: MiddleEastCountry, Y: Language 

Q2= q(X) <— speaks (X,Y) & X: MiddleEastCountry, Y : IndoEuropeanLanguage 

Q3= q(X) ^ believes(X,Y)& X: MiddleEastCountry, Y: MuslimReligion 

where Qt is the trivial O-query for Cqjp^^, Qi G ^cjp^^ Q2 G 'Cqjj^, and £ '^ciA' 

Output concepts are the concepts automatically formed out of the input ones 
by taking into account the relational data source. Thus, an output concept C has 
an O-query Q £ L as intension and the set answer set{Q , B) of correct answers to 
Q w.r.t. B as extension. Note that this set contains the substitutions ^^'s for the 
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distinguished variable of Q such that there exists a correct answer to body{Q)9i 
w.r.t. B. In other words, the extension is the set of individuals of CVe/ satisfying 



the intension. Also with reference to Section 3.3 note that proving that an O-query 
Q covers an observation (q{ai), Ai) w.r.t. /C equals to proving that 0i = {X/a,} is 
a correct answer to Q w.r.t. Bi — IC U Ai- 

Example 4 

The output concept having Qi as intension has extension answerset(Qi, Bqj{^) — 
{'ARM', 'IR', 'SA', 'UAE'}. In particular, Qi covers the observation (q( ' IR' ), ^jj^) 
w.r.t. /CciA- This coverage test is equivalent to answering the query <— q('IR') 
w.r.t. /CciaU^irU Qi. 

Output concepts are organized into a taxonomy Q rooted in Crej and struc- 
tured as a Directed Acyclic Graph (DAG) according to the subset relation between 
concept extensions. Note that one such ordering is in line with the set-theoretic se- 
mantics of the subsumption relation in ontology languages (see, e.g., the semantics 
of C m A£C). 



4-2 Pattern-based clustering 



Frequent Pattern Discovery is about the discovery of regularities in a data set 



(Mannila and Toivonen 19971. A frequent pattern is an intensional description. 



expressed in a language £, of a subset of a given data set r whose cardinality 
exceeds a user-defined threshold (minimum support). Note that patterns can refer 



to multiple levels of description granularity (multi-grained patterns) ( Han and Fu 



19991. Here r typically encompasses a taxonomy T. More precisely, the problem of 



frequent pattern discovery at I levels of description granularity, 1 < I < maxG, is to 
find the set of all the frequent patterns expressible in a multi-grained language 
£ = {C''}i<i<maxG and evaluated against r w.r.t. a set {Tninsup''}i<:i<:,naxG of 
minimum support thresholds by means of the evaluation function supp. In this 
case, P G C'' with support s is frequent in r if (i) s > minsup^ and (ii) all ancestors 
of P w.r.t. T are frequent in r. The blueprint of most algorithms for frequent 



pattern discovery is the levelwise search method (Mannila and Toivonen 1997) 



which searches the space (C,>) of patterns organized according to a generality 
order ^ in a breadth-first manner, starting from the most general pattern in C and 
alternating candidate generation and candidate evaluation phases. The underlying 
assumption is that ^ is a quasi-order monotonic w.r.t. supp. Note that the method 



variant of the task defined in (Han and Fu 19991 



proposed in (Mannila and Toivonen 19971 is also at the basis of algorithms for the 



A frequent pattern highlights a regularity in r, therefore it can be considered 
as the clue of a data cluster. Note that clusters are concepts partially specified 
(called emerging concepts): only the extension is known. We propose to detect 



emerging concepts by applying the method of ( Lisi and Malerba 2004 1 for frequent 
pattern discovery at Z, 1 < / < maxG, levels of description granularity and fc, 



1 < fc < maxD, levels of search depth. It adapts (Mannila and Toivonen 1997 Han 



and Fu 1999 1 to the KR&R framework of ^£-log as follows. For £ being a multi- 
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grained language of O-queries, we need to define first supp, then y. The support 
of an O-query Q £ C w.r.t. an AC-\og knowledge base B is defined as 

supp(Q,B) =1 answer set{Q,B) \ / | answer set{Qt,B) \ 

and supplies the percentage of individuals of Cref that satisfy Q. 

Example 5 

Since | answerset{Qt,BQip^) \ — \ MiddleEastCountry |= 15, then swpp((5i, Scia) = 
4/15 = 26 • 6%. 

Being a special case of constrained Datalog clauses, O-queries can be ordered 
according to the S-subsumption relation introduced in Section |3.2[ It has been 
proved that is a quasi-order that fulfills the condition of monotonicity w.r.t. 



supp (List and Malerba 20041. Also note that the underlying reasoning mechanism 
of ^£-log makes i3-subsumption more powerful than generalized subsumption as 
illustrated in the following example. 

Example 6 

It can be checked that Qi >b Q2 by choosing a—{X/a., Y/b} as a Skolem substitu- 
tion for Q2 w.r.t. B(2ip^U{Qi} and = as a substitution for Qi. Similarly it can be 
proved that Q2 Qi- Furthermore, it can be easily verified that Q3 S-subsumes 
the following O-query in ^Cqj^^ 

Q4~ q(A) ^ believes(A,B), believes(A,C)& 

A : MiddleEastCountry, B : MuslimReligion 

by choosing (T={A/a, B/b, C/c} as a Skolem substitution for w.r.t. Bqjp^ U {Qa} 
and 0={X/k, Y/B} as a substitution for Q^. Note that Q4 under the 01 

bias. Indeed this bias does not admit the substitution {A/X, B/Y, C/Y} for Q4 which 
would make it possible to verify conditions (i) and (ii) of the test. 

We would like to emphasize that S, besides contributing to the definition of C 



see Section 4.1 1, plays a key role in the test. 



4-3 Bias-based characterization 



Since several frequent patterns can have the same set of supporting individuals, 
turning clusters into concepts is crucial in our approach. Biases can be of help. A 



bias concerns anything which constrains the search for theories (UtgofF and Mitchell 



19821. In ILP language bias has to do with constraints on the clauses in the search 



space whereas search bias has to do with the way a system searches its space of 
permitted clauses (Nedellec et al. 19961. The choice criterion for concept inten- 



sions has been obtained by combining two orthogonal biases: a language bias and 



a search bias (Lisi and Esposito 20061. The former allows the user to define con- 
ditions on the form of C-queries to be accepted as concept intensions. E.g., it is 
possible to state which is the minimum level of description granularity (parameter 
miuG) and whether (all) the variables must be ontologically constrained or not. 
The latter allows the user to define a preference criterion based on ;8-subsumption. 
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More precisely, it is possible to state whether the most general description (m.g.d.) 
or the most specific description (m.s.d.) w.r.t. '(ZB has to be preferrred. Since >zb is 
not a total order, it can happen that two patterns P and Q, belonging to the same 
language C, can not be compared w.r.t. >zb- In this case, the m.g.d. (resp. m.s.d) 
of P and Q is the union (resp. conjunction) of P and Q. 

Example 7 
The patterns 

q(A) ^ speaks (A, B), believes (A, C) & A:MiddleEastCouiitry, B:ArabicLanguage 
and 

q(A) <— believes(A,B), speaks(A,C) & A iMiddleEastCountry, B:MuslimReligion 

have the same answer set {'ARM', 'IR'} but are incomparable w.r.t. Their 
m.g.d. is the union of the two: 

q(A) ^ speaks (A, B), believes (A, C) & A:MiddleEastCouiitry, B:ArabicLaiiguage 
q(A) ^ believes (A, B), speaks (A, C) & A:MiddleEastCouiitry, B:MuslimReligion 

Their m.s.d. is the conjunction of the two: 

q(A) ^ believes(A,B), speaks(A,C), speaks (A, D), believes(A,E) & 
A:MiddleEastCountry, B :MuslimReligion, C : ArabicLanguage 

The extension of the subsequent concept will be {'ARM', 'IR'}. 

The two biases are combined as follows. For each frequent pattern P E C that 
fulfills the language bias specification, the procedure for building the taxonomy 
Q from the set = {!Fl | 1 < ^ < maxG, I < k < maxD} checks whether a 
concept C with ext(C) = answerset(P) already exists in Q. If one such concept is 
not retrieved, a new node C with int{C) ~ P and ext(C) = answerset{P) is added 
to G. Note that the insertion of a node can imply the reorganization of G to keep it 
compliant with the subset relation on extents. If the node already occurs in Q , its 
intension is updated according to the search bias specification. 



4-4 Experimental Results 



In order to test the approach we have extended the ILP system AC-Quln (Lisi 



2006 with a module for post-processing frequent patterns into concepts. The goal 
of the experiments is to provide an empirical evidence of the orthogonality of the 
two biases and of the potential of their combination as choice criterion. The results 
reported in the following are obtained for the problem introduced in Example |3] by 
setting the parameters for the frequent pattern discovery phase as follows: maxD = 
5, maxG = 3, minsup^ = 20%, minsup'^ — 13%, and minsup^ — 10%. Thus each 
experiment starts from the same set J- of 53 frequent patterns out of 99 candidate 
patterns. Also all the experiments require the descriptions to have all the variables 
ontologically constrained but vary as to the user preferences for the minimum level 
of description granularity {minG) and the search bias (m.g.d. /m.s.d.). 



^ ^£-QuIn is implemented with Prolog but equipped with a module for pre-processing OWL 
ontologies in order to enable Semantic Web Mining applications. 
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Figure 3. Taxonomy GqiA obtained with the m.g.d. criterion for minG = 2. 



The first two experiments both require the descriptions to have all the vari- 
ables ontologically constrained by concepts from the second granularity level on 
{minG — 2). When the m.g.d. criterion is adopted, the procedure of taxonomy 
building returns the following twelve concepts: 

C-1111 e T\ 

q(A) -5— A:MiddleEastCountry 

{ARM, BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE, YE} 
C-5233 G Tl 

q(A) ^ believes (A, B) & A:MiddleEastCountry, B : MonotheisticReligion 
{ARM, BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE} 
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C-2233 G 

q(A) ^ speaks(A,B) & A:MiddleEastCountry, B: AfroAsiaticLanguage 
{IR, SA, YE} 

C-3233 G J'^f 

q(A) <— speaks ( A, B) & A:MiddleEastCountry, B : IndoEuropeanLanguage 
{ARM, IR} 

C-8256 G 

q(A) ^ speaks(A,B), believes(A,C) & 

A:MiddleEastCountry, B : AfroAsiaticLanguage, C : MonotheisticReligion 

{IR, SA} 
C-6256 G 

q(A) ^ believes(A,B), believes(A,C) & 

A:MiddleEastCountry, B : MonotheisticReligion, C : MonotheisticReligion 
{BRN, IR, IRQ, IL, JDR, RL, SYR} 

C-2333 G T-i 

q(A) ^ believes(A, 'Druze') & A : MiddleEastCountry 
{IL, SYR} 

C-3333 G T'i 

q(A) <— believes (A, B) & A: MiddleEastCountry, B : JewishReligion 
{IR, IL, SYR} 

C-4333 G 

q(A) ^ believes (A, B) & A: MiddleEastCountry, B : ChristianReligion 
{ARM, IR, IRQ, IL, JDR, RL, SYR} 

C-5333 G 

q(A) ^ believes (A, B) & A: MiddleEastCountry, B : MuslimReligion 
{BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE} 

C- 14356 G T'i 

q(A) ^ believes(A,B), believes (A, C) & 

A: MiddleEastCountry, B : ChristianReligion, C: MuslimReligion 
{IR, IRQ, IL, JOR, RL, SYR} 

C-5356 G JP"! 

q(A) ^ believes(A,B), believes (A, C) & 

A: MiddleEastCountry, B: MuslimReligion, C : MuslimReligion 
{BRN, IR, SYR} 

organized in the DAG ^ciA (s*^*^ Figure |3]). They are numbered according to the 
chronological order of insertion in t/ciA annotated with information of the 
generation step. From a qualitative point of view, concepts 0-223^^ and C-5333 
well characterize Middle East countries. Armenia (ARM), as opposite to Iran (IR), 
does not fall in these concepts. It rather belongs to the weaker characterizations 
C-3233 and C-4333. This suggests that our procedure performs a 'sensible' cluster- 
ing. Indeed Armenia is a well-known borderline case for the geo-political concept of 
Middle East, though the Armenian is usually listed among Middle Eastern ethnic 



c-2233 is less populated than expected because Bqi/^ does not provide facts on the languages 
spoken for all countries. 
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fEI EITT .'TEP. 



EXTEtT;"!ION 

B EH ,1 R.I EQ ,1 L. JO R.KtVT.RL.O M, Q .3 A S YR. T R. U AE 

INTENSION 
^^•:-helieve<A.H), & c_MiiMleEastC [iimtij(A), 
c_MuilmRe1iEinnCB) 



:TEt':: 



EXTENSION 

IR.3A 

INTENSION 
ipent^A. 3). helieve^A, G). 
CVB, & c_Midd1eE B5tC inijifai<A). c_AiiibiG(H), c_Mu5limReliEion(C) 



C^333 

GEAI-nJLAP.ITY LEVEL lEEnNEt-rEl-lT STEP^ 



c_C hri stiBnReli B □ n(a) 



OPJJ^njLAIlTr". 



!"1-TEI lENT iJTEPS 



EXTENSION 

HRN,IR.IRQ.I L,JOR,RL,B VR 

INTENSION 
; c_Middl3E artC nuiiti^A), =_MnnntheiiticReliB[mfa), 



:icReliEiDti(C) 



jE.AlTULAP.rrY LEVEL EEnNElIENT STEPS 



EXTENSION 

AfiM,IR 

INTENSION 

c(A)C-!peBt<A.B), bel,e«e<A.C), 
c_MiddleEBStCnimtij<A). cJiidnEuiai]eBiiLangi.ifLge<B^. 
c_MniintliatticReligantC) ' 



GP.Al-TULAP.rrY LEVEL FiPINEtlENr STEPS 



EXTENSION 

lR,rRQ,lL,jaH,HL,SYR 

INTENSION 

(A) <- behs-^KA B), b,i,eve<A. C). 
c_MiddleE BitC [iiintr3<A). c_GluistituiReUgiDiXe). 
G_MudmRsliEiDnCt5 



J-RtJFJLAP.rrY LEVEL EEnHEl.IEHT STEPS 



EXTENSION 

BRH.IR.SYR 

INTENSION 

[A) <- -beliaveitA H), -bdia-ra^A C), 
c_UiddleE aitC mintrytA). c_MudimReliginii(B>, 
c_M-uilimHaliEiDt<Cr) 



T'lTV LEVEL PJEHlIEl lElW STEPf:: 



INTENSION 

L!Vi<A, D), n^B. D\-C, a !_Midril^iitCniintayfA), 
c_]ewiihR8ligini(B), c_CluiitiBiiRaliBiDni;C). 
c_MudimRdigionCD) 



GRA1TUL_4EJT Y LEVEL PJEEINEtffiNT STEPS 



INTENSION 

q:A) belie^a;A. 3). bdi=ve<A, C). 
CV=a. heheve<A. □:). DV=a. m=C. believei<A, E^. 
^a, E\-C, EV-D, £c cJSiAacKBtiCnuniryfA), c_nruza( 
cJewishRehgiraiCCJ. c_CluiibsnHeliBinn(0), 
c_MuslunHdiEiDnCE) 



Figure 4. Taxonomy Gnj^ obtained with the m.s.d. criterion for minG = 2. 



groups. Modern experts tend nowadays to consider it as part of Europe, therefore 
out of Middle East. But in 1996 the on-hne CIA World Fact Book still considered 
Armenia as part of Asia. 

When the m.s.d. criterion is adopted (see Figure [4|), the intensions for the con- 
cepts C-2233, C-3233, C-8256, C-2333 and C-3333 change as follows: 

C-2233 € 

q(A) ^ specLks(A,B) & A:MiddleEastCountry, B : ArabicLcinguage 
{IR, SA, YE} 



C-3233 € 
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q(A) <— speaks (A, B) & A : MiddleEastCountry, B: IndoIranianLanguage 
{ARM, IR} 

C-8256 e J^l 

q(A) •<— speaks (A, B), believes (A, C) & 

A : MiddleEastCountry, B:ArabicLanguage, C:MusliinReligion 

{IR, SA} 
C-2333 G J^'i 

q(A) <— believes(A, 'Druze'), believes(A,B), believes (A, C) , believes(A,D) & 

A : MiddleEastCountry, B : JewishReligion, 

C : ChristianReligion, D : MuslimReligion 
{IL, SYR} 

C-3333 € Ti 

q(A) <— believes(A,B) , believes (A ,C) , believes (A, D) & 

A : MiddleEastCountry, B : JewishReligion, 

C : ChristianReligion, D : MuslimReligion 
{IR, IL, SYR} 

In particular C-2333 and C-3333 look quite overfitted to data. Yet overfitting allows 
us to realize that what distinguishes Israel (IL) and Syria (SYR) from Iran is just 
the presence of Druze people. Note that the clusters do not change because the 
search bias only affects the c;liarac;tcrizatioii step. 

The other two experiments further restrict the conditions of the language bias 
specification. Here only descriptions with variables constrained by concepts of gran- 
ularity from the third level on {minG = 3) arc considered. When the m.g.d. option 
is selected, the procedure for taxonomy building returns the following nine concepts; 

C-1111 G Tl 

q(A) ^ A: MiddleEastCountry 

{ARM, BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE, YE} 
C-9333 G 

q(A) <— speaks (A, B) & A:MiddleEastCoimtry, B:ArabicLanguage 
{IR, SA, YE} 

C-2333 € ^3 

q(A) •<— believes (A, 'Druze') & A:MiddleEastCountry 
{IL, SYR} 

C-3333 G 

q(A) <— believes ( A, B) & A: MiddleEastCountry, B: JewishReligion 
{IR, IL, SYR} 

C-4333 G J^'i 

q(A) ^ believes(A,B) & A : MiddleEastCountry, B: ChristianReligion 
{ARM, IR, IRQ, IL, JOR, RL, SYR} 

C-5333 G Ti 

q(A) ^ believes (A, B) & A: MiddleEastCountry, B : MuslimReligion 
{BRN, IR, IRQ, IL, JOR, KWT, RL, OM, Q, SA, SYR, TR, UAE} 

C-33356 G J"! 

q(A) •<— speaks(A,B), believes(A,C) & 

A : MiddleEastCountry, B:ArabicLanguage, C: MuslimReligion 

{IR, SA} 
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c-iui 

aSLANULAI:IT"i" LE'.TL I HHHEMENT STEPS 



EXTENSION 
ARM,aRHJR,IRr3.IL,JOR.KWr.RL,OM,q,3A3YH,TR,UAE,YE 
INTENSION 

<- c_Mi[MleEBitCrMntf^A] 



C-9333 

GRMIULAEITY LEVEL lEEEINEtlENT STEPS 



EXTENSION 
IR.aA,YE 



INTENSION 



C4333 

!EANULAI;ITY LEVELlP^nHEliIENT STEPS 



EXTENSION 
AIiM,IR,IRq,IL.jaR,RL,5YR 



INTENSION 
cf^ <- behes^^A, S), & c_MitMleEa!lCDLUitiii;A), 



C-J 333 

aEANDLAISirT LEVEL IPJEINElvIENT STEPS 

3 \ 3 

EXTENSION 
HRN,IR,IRq,IL,]aR,KWT,RL,OM,q.3A5yR.TR,UAE 

INTENSION 
cf;A^ <- believea;A B), & c_MiddleEa5tCcuntij(A), 
c_MuSliiaReligiaii(B) 



ap/J-TULAI_n"x" LEVEL PXPINEtffiNT STEPS 



INTENSION 
[f;A^ <- ipaak^A H). believe<A C), 
;_MicJdlaEa3tC[)uiitty(A). c_Aiabic(^, c_MiislimReHgicn(C) 



C-14356 

'ELlPE-nHELIEOT STEP:- 



EXTENSION 
IRIRq,IL,]0RRL,5YR 



INTENSION 
,(J^ <-- bakava<A H), bdiava<A C), 
& c_MiddleEBStCnuiitr5(AX c_ClTri!tiaiiReHgimT(B). 



C-5356 

.■^■."EL|E_EFn-rElvIEl-IT STEPS 



EXTENSION 
BRH,IR,3yR 



INTENSION 
[1;A) <- beHeve^A B^ bdieve^A 
& c_MicidleEastCnuiitT5(A), c MuriiinRaliBnnfB). 
c_MuElmiReligimi(C) 



' le'tlpjehnement steps 
5 



extension 

IRIL.SrR 

intension 

<- beHeves(A H), & c_Mi[MleEBitCm 
cJewiriiReksmXHl 



C-2333 

GEANULAILITY LEVELlP^nilEMENT STEPS 

3 \ 3 

EXTENSION 

1L,SYR 

INTENSION 
<- beEes^!<A B^, & c_MiddleEaslCDLUitij<A), 
i_DiiB.(B) 



Figure 5. Taxonomy ^^ta obtained with the ni.g.d. criterion for minG = 3. 



C-14356 G 

q(A) <— believes (A, B), believes (A, C) & 

ArMiddleEastCountry, B : ChristianReligion, C:HuslimReligion 
{IR, IRq, IL, JOR, RL, SYR} 

C-5356 G T'i 

q(A) ^ believes(A,B), believes(A,C) & 

A : MiddleEastCountry, B :MuslimReligion, C:MuslimReligion 
{BRN, IR, SYR} 

organized in a DAG Qqh^ (see Figure [s]) which partially reproduces S^IA' ^^^^ 
that the stricter conditions set in the language bias cause three concepts occurring 
in G'qii^ not to appear in Gqi{^- the scarsely significant C-5233 and C-6256, and 
the quite interesting C-3233. Therefore the language bias can prune the space of 
clusters. Note that the other concepts of G'qjj^ emerged at Z = 2 do remain in Gqi^^ 
as clusters but with a different characterization: C-9333 and C-33356 instead of 
C-2233 and C-8256, respectively. 

When the m.s.d. condition is chosen (see Figure|6|, the intensions for the concepts 
C-2333 and C-3333 change analogously to Gnjk- Note that both Gr-jt and Gnjr, 
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C-lUl 

aRANULAI:IT"i" LE'-"EL I P^:m'TEIv,IENT STEPS 



I 



1 



EXTENSION 
ARM,aRNJR,IRg.lL,]OR.KWr.RL,OM,q,3A3YH,TH.UAE,YE 
INTENSION 
<- c_Mi[MleEBitCnLinlTy;A] 



C-9333 

aRAl'TULAP_rTT LEVEL IEEHNEMENT STEPS 

3 \ 3 

EXTENSION 

IR.5A,YE 

INTENSION 
q;^^ <- fpeak^A, B). & c_MiddleEa£tCai.uih-j(A). 
c_Aialici:B) 



aRANULAPLirr level |eee[nei.ient steps 

3 \ 3 

extension 

ARM,IR, IRQ, IL. jg R, RL.SYR 

intension 

<iA) <- belies^^A, E). & c_MiddleEa!lCDUiitiii;A), 
c_C hi-i stiaiiReli@ QtuB) 



C-5333 

geawulapjtylevel|p^e[nei^'Ient steps 

3 \ 3 

extension 

H RH ,IR.IRO,I L.]a R,KWr,RL,Q M.g .3 A,STR. TR,UAE 

intension 

ij;^<-believea:A,B), & c_MiddleEa5tCDittitij(A). 
c_Mus1uiiReligiciti(ffl 



^>RAl■TULAE.^T"l" LEVEL PEETIIEK'IEI-JT STEPS 

3 I 5 

EXTENSION 

IR5A 

INTENSION 
^Ai <;.. ipeiik<A, 0), -believeiA, C), 
■fB. & c_MuldlaEa!tC[>imtiy(A). c_Aiabic(ff). c_Mii!iimReligionCC) 



C-14356 

. _ . ._ . LE'-"EL|PEnNEi,IENT STEPS 

3 I 5 

EXTENSION 

IR,IR(j,IL,]0H,RL,5YR 

INTENSION 
<iJ^ ■halieve<A B). ■hdieve<A, C). 
& c_MiddleEBStCmuitr5(AX c_ClTrirtiaiiReligion(H). 
t_Mi.iEliinReligiaiXC;) 



C-5356 

GEANULAPJT Y LEVEL IP^EFTMEIvIEl-rr STEPS 

3 I i 

EXTENSION 

BRH,IR,3TR 

INTENSION 
cj;^<-believa<A,a), believs^A, C), 
-B. & c_Mi(idleEaitCnuiitT5<A), c_MuiliinReliBnr<B). 
c_MuElimReligiQi-i(C) 



..inrr -^^---lE'.'EL P^m-ffitlElrJT STEPS 

3 I 3 

EXTENSION 
IR,IL.EyR 

INTENSION 

<- be1ieve<A B). helieve<A, C), 
C^B, behevea;A, D), D¥H. D\-C, & c_Mid[ileEaEtC[jLuitiy[A), 
c_Je'sishRe1igicit<Hi, c_C:hiiihBtiReligioi-i(C). 
c_Mu5lmiRdiginiXDi 



i.E","EL P^ni-TEIIENT STEPS 



INTENSION 
<- ■helieve<A B). ■bdieve<A G). 
C^-B, beheve^A, D), DVB, D\-C, believe<A, E). 
E\=B, EV=C. E\=a. & c_Mid[lleEHstCnimtT!<A). e_DiiEe(a). 
c_JewishRaliBim<C), c_ChiiihBiiRalLEinn(IJi, 
cMuslim Rel i gi dh(E) 



Figure 6. Taxonomy Gnjf, obtained with the m.s.d. criterion for minG = 3. 



are hierarchical taxonomies. It can be empirically observed that the possibility 
of producing a hierarchy increases as the conditions of the language bias become 
stricter. 



5 Conclusions 



Building rules on top of ontologies for the Semantic Web is a task that can be 
automated by applying Machine Learning algorithms to data expressed with hy- 
brid formalisms combining DLs and Horn clauses. Learning in DL-based hybrid 
languages has very recently attracted attention in the ILP community. In fRou- 



[veirol and Ventos 2000 ) the chosen language is G akin- ACM , therefore example 
coverage and subsumption between two hypotheses are based on the existential en- 



tailment algorithm of Carin (Levy and Rousset 1998). Following (Rouveirol and 



Ventos 2000 1 , Kietz studies the learnability of Cakw-ALN , thus providing a pre- 



processing method which enables ILP systems to learn Cakw-ALN rules (Kietz 
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20031. Closely related to DL-based hybrid systems are the proposals arising from 



the study of many-sorted logics, where a first-order language is combined with a 



sort language which can be regarded as an elementary DL (Frisch 19911. In this 



respect the study of a sorted downward refinement (Frisch 1999 1 can be also consid- 



ered a contribution to learning in hybrid languages. In this paper we have proposed 
a general framework for learning in ^£-log. We would like to emphasize that the 
DL-safeness and the decidability of ^£-log are two desirable properties which are 
particularly appreciated both in ILP and in the Semantic Web application area. 

As an instantiation of the framework we have considered the case of character- 
istic induction from interpretations, more precisely the task of Frequent Pattern 
Discovery, and an application to Ontology Refinement. The specific problem at 
hand takes an ontology as input and returns subconcepts of one of the concepts 
in the ontology. A distinguishing feature of our setting for this problem is that 
the intensions of these subconcepts are in the form of rules that are automatically 
built by discovering strong associations between concepts in the input ontology. 
The idea of resorting to Frequent Pattern Discovery in Ontology Learning has been 



already investigated in (Maedche and Staab 20001. Yet there are several differences 



between (Maedche and Staab 20001 and the present work: (Maedche and Staab 



20001 is conceived for Ontology Extraction instead of Ontology Refinement, uses 



generalized association patterns (bottom-up search) instead of multi-level associa- 
tion patterns (top-down search), adopts propositional logic instead of FOL. Within 



the same application area, (Maedche and Zacharias 2002 1 proposes a distance-based 



method for clustering in RDF which is not conceptual. Also the relation between 
Frequent Pattern Discovery and Concept Formation as such has never been in- 



vestigated. Rather our pattern-based approach to clustering is inspired by (Xiong 



et al. 20051. Some contact points can be also found with (Zimmermann and Raedt 



2004 1 that defines the problem of cluster-grouping and a solution to it that inte- 



grates Subgroup Discovery, Correlated Pattern Mining and Conceptual Clustering. 



Note that neither ( Xiong et al. 2005 ) nor ( Zimmermann and Raedt 2004 1 deal with 



(fragments of) FOL. Conversely, (Stumme 20041 combines the notions of frequent 



Datalog query and iceberg concept lattices to upgrade Formal Concept Analysis 
(a well-established and widely used approach for Conceptual Clustering) to FOL. 
Generally speaking, very few works on Conceptual Clustering and Concept For- 
mation in FOL can be found in the literature. They vary as for the approaches 
(distance-based, probabilistic, etc.) and/or the representations (description logics, 
conceptual graphs, E/R models, etc.) adopted. The closest work to ours is Vrain's 



proposal (Vrain 19961 of a top-down incremental but distance-based method for 



Conceptual Clustering in a mixed object-logical representation. 

For the future we plan to extensively evaluate this approach on significantly big 
and expressive ontologies. Without doubt, there is a lack of evaluation standards 
in Ontology Learning. Comparative work in this field would help an ontology engi- 
neer to choose the appropriate method. One step in this direction is the framework 



presented in (Bisson et al. 2000) but it is conceived for Ontology Extraction. The 



evaluation of our approach can follow the criteria outlined in ( Dellschaft and Staab 



2006 1 or criteria from the ML tradition like measuring the cluster validity ( Halkidi 



ML 



F.A. Lisf 



et al. 20011, or the category utility (Fisher 1987). Anyway, due to the peculiarities 



of our approach, the evaluation itself requires a preliminary work from the method- 
ological point of view. Regardless of performance, each approach has its own ben- 
efits. Our approach has the advantages of dealing with expressive ontologies and 
being conceptual. One such approach, and in particular its ability of forming con- 
cepts with an intensional description in the form of rule, can support many of the 
use cases defined by the W3C Rule Interchange Format Working Group. Another 
direction of future work can be the extension of the present work towards hybrid 



formalisms, e.g. (Motik et al. 2004 1, that are more expressive than AC-log and more 



inspiring for prototipical SWRL reasoners. Also we would like to investigate other 
instantiations of the framework, e.g. the ones in the case of discriminant induction 
to learn predictive rules. 



Appendix A The semantic mark-up language OWL 

The Web Ontology Language OWL is a semantic mark-up language for publishing 



and sharing ontologies on the World Wide Web (Horrocks et al. 20031. An OWL 
ontology is an RDF graph, which is in turn a set of RDF triples. As with any RDF 
graph, an OWL ontology graph can be written in many different syntactic forms. 
However, the meaning of an OWL ontology is solely determined by the RDF graph. 
Thus, it is allowable to use other syntactic RDF/XML forms, as long as these result 
in the same underlying set of RDF triples. 

OWL provides three increasingly expressive sublanguages designed for use by 
specific communities of implementers and users. 

• OWL Lite supports those users primarily needing a classification hierarchy 
and simple constraints. E.g., while it supports cardinality constraints, it only 
permits cardinality values of or 1. It should be simpler to provide tool sup- 
port for OWL Lite than its more expressive relatives, and OWL Lite provides 
a quick migration path for thesauri and other taxonomies. OWL Lite also has 
a lower formal complexity than OWL DL. 

• OWL DL supports those users who want the maximum expressiveness while 
retaining computational completeness and decidability. OWL DL includes all 
OWL language constructs, but they can be used only under certain restric- 
tions (e.g., while a class may be a subclass of many classes, a class cannot be 
an instance of another class). OWL DL is so named due to its correspondence 



with the very expressive DL SH02J\f(D) (Horrocks et al. 20001 which thus 
provides a logical foundation to OWL. The mapping from ACC to OWL is 
reported in Table [AT] 

OWL Full is meant for users who want maximum expressiveness and the 
syntactic freedom of RDF with no computational guarantees. For example, in 
OWL Full a class can be treated simultaneously as a collection of individuals 
and as an individual in its own right. OWL Full allows an ontology to augment 
the meaning of the pre-defined (RDF or OWL) vocabulary. It is unlikely that 
any reasoning software will be able to support complete reasoning for every 
feature of OWL Full. 
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Each of these sublanguages is an extension of its simpler predecessor, both in what 
can be legally expressed and in what can be validly concluded. 

Table A 1. Mapping from ACC to OWL 



nC <owl:Class> 

<owl : complementOf Xowl : Class rd.f:ID="C" /></owl:complementOf> 
</owl:Class> 
Cr\D <owl:Class> 

<owl : intersectionOf rdf :parseType="Collection"> 

<owl: Class rdf:ID="C" /Xowl: Class rdf:ID="D" /> 
</owl : intersectionOf > 
</owl:Class> 
Cud <owl:Class> 

<owl :unionOf rdf :parseType="Collection"> 

<owl: Class rdf:ID="C" /Xowl: Class rdf:ID="D" /> 
</owl : imionOf > 
</owl:Class> 
3R-C <owl :Restriction> 

<owl : onProperty rdf : resource="#R" /> 
<owl : someValuesFrom rdf :resource="#C" /> 
</owl :Restriction> 
yR-C <owl :Restriction> 

<owl : onProperty rdf : resource="#R" /> 
<owl : allValuesFrom rdf :resoiirce="#C" /> 
</owl :Restriction> 



C = D <owl:Class rdf:ID="C"> 

<owl:saineAs rdf :resource="#D" /> 
</owl:Class> 
CCD <owl:Class rdf:ID="C"> 

<rdf s:subClassOf rdf :resource="#D" 
</owl:Class> 



/> 



a: C <C rdf :ID="a" /> 

(a,b) : R <C rdf :ID="a"xR rdf :resource="#b" /> 
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